68 research outputs found
OPR
The ability to reproduce a parallel execution is desirable for debugging and program reliability purposes. In debugging (13), the programmer needs to manually step back in time, while for resilience (6) this is automatically performed by the the application upon failure. To be useful, replay has to faithfully reproduce the original execution. For parallel programs the main challenge is inferring and maintaining the order of conflicting operations (data races). Deterministic record and replay (R&R) techniques have been developed for multithreaded shared memory programs (5), as well as distributed memory programs (14). Our main interest is techniques for large scale scientific (3; 4) programming models
QFAST: Conflating Search and Numerical Optimization for Scalable Quantum Circuit Synthesis
We present a quantum synthesis algorithm designed to produce short circuits
and to scale well in practice. The main contribution is a novel representation
of circuits able to encode placement and topology using generic "gates", which
allows the QFAST algorithm to replace expensive searches over circuit
structures with few steps of numerical optimization. When compared against
optimal depth, search based state-of-the-art techniques, QFAST produces
comparable results: 1.19x longer circuits up to four qubits, with an increase
in compilation speed of 3.6x. In addition, QFAST scales up to seven qubits.
When compared with the state-of-the-art "rule" based decomposition techniques
in Qiskit, QFAST produces circuits shorter by up to two orders of magnitude
(331x), albeit 5.6x slower. We also demonstrate the composability with other
techniques and the tunability of our formulation in terms of circuit depth and
running time
Improving Quantum Circuit Synthesis with Machine Learning
In the Noisy Intermediate Scale Quantum (NISQ) era, finding implementations
of quantum algorithms that minimize the number of expensive and error prone
multi-qubit gates is vital to ensure computations produce meaningful outputs.
Unitary synthesis, the process of finding a quantum circuit that implements
some target unitary matrix, is able to solve this problem optimally in many
cases. However, current bottom-up unitary synthesis algorithms are limited by
their exponentially growing run times. We show how applying machine learning to
unitary datasets permits drastic speedups for synthesis algorithms. This paper
presents QSeed, a seeded synthesis algorithm that employs a learned model to
quickly propose resource efficient circuit implementations of unitaries. QSeed
maintains low gate counts and offers a speedup of in synthesis time
over the state of the art for a 64 qubit modular exponentiation circuit, a core
component in Shor's factoring algorithm. QSeed's performance improvements also
generalize to families of circuits not seen during the training process.Comment: 11 pages, 10 figure
Classical Optimizers for Noisy Intermediate-Scale Quantum Devices
We present a collection of optimizers tuned for usage on Noisy
Intermediate-Scale Quantum (NISQ) devices. Optimizers have a range of
applications in quantum computing, including the Variational Quantum
Eigensolver (VQE) and Quantum Approximate Optimization (QAOA) algorithms. They
are also used for calibration tasks, hyperparameter tuning, in machine
learning, etc. We analyze the efficiency and effectiveness of different
optimizers in a VQE case study. VQE is a hybrid algorithm, with a classical
minimizer step driving the next evaluation on the quantum processor. While most
results to date concentrated on tuning the quantum VQE circuit, we show that,
in the presence of quantum noise, the classical minimizer step needs to be
carefully chosen to obtain correct results. We explore state-of-the-art
gradient-free optimizers capable of handling noisy, black-box, cost functions
and stress-test them using a quantum circuit simulation environment with noise
injection capabilities on individual gates. Our results indicate that
specifically tuned optimizers are crucial to obtaining valid science results on
NISQ hardware, and will likely remain necessary even for future fault tolerant
circuits.Comment: 11 pages, 17 figure
Scheduling Dynamic Parallelism On Accelerators
Resource management on accelerator based systems is complicated by the disjoint nature of the main CPU and accelerator, which involves separate memory hierarhcies, different degrees of parallelism, and relatively high cost of communicating between them. For applications with irregular parallelism, where work is dynamically created based on other computations, the accelerators may both consume and produce work. To maintain load balance, the accelerators hand work back to the CPU to be scheduled. In this paper we consider multiple approaches for such scheduling problems and use the Cell BE system to demonstrate the different schedulers and the trade-offs between them. Our evaluation is done with both microbenchmarks and two bioinformatics applications (PBPI and RAxML). Our baseline approach uses a standard Linux scheduler on the CPU, possibly with more than one process per CPU. We then consider the addition of cooperative scheduling to the Linux kernel and a user-level work-stealing approach. The two cooperative approaches are able to decrease SPE idle time, by 30 % and 70%, respectively, relative to the baseline scheduler. In both cases we believe the changes required to application level codes, e.g., a program written with MPI processes that use accelerator based compute nodes, is reasonable, although the kernel level approach provides more generality and ease of implementation, but often less performance than work stealing approach
Maximizing Communication Overlap with Dynamic Program Analysis
International audienceWe present a dynamic program analysis approach to optimize communication overlap in scientific applications. Our tool instruments the code to generate a trace of the application's memory and synchronization behavior. An offline analysis determines the program optimal points for maximal overlap when considering several programming constructs: nonblocking one-sided communication operations, non-blocking collectives and bespoke synchronization patterns and operations. Feedback about possible transformations is presented to the user and the tool can perform the directed transformations, which are supported by a lightweight runtime. The value of our approach comes from: 1) the ability to optimize across boundaries of software modules or libraries, while specializing for the intrinsics of the underlying communication runtime; and 2) providing upper bounds on the expected performance improvements after communication optimizations. We have reduced the time spent in communication by as much as 64% for several applications that were already aggressively optimized for overlap; this indicates that manual optimizations leave untapped performance. Although demonstrated mainly for the UPC programming language, the methodology can be easily adapted to any other communication and synchronization API
SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics
Transformer-based models, such as BERT and ViT, have achieved
state-of-the-art results across different natural language processing (NLP) and
computer vision (CV) tasks. However, these models are extremely memory
intensive during their fine-tuning process, making them difficult to deploy on
GPUs with limited memory resources. To address this issue, we introduce a new
tool called SlimFit that reduces the memory requirements of these models by
dynamically analyzing their training dynamics and freezing less-contributory
layers during fine-tuning. The layers to freeze are chosen using a runtime
inter-layer scheduling algorithm. SlimFit adopts quantization and pruning for
particular layers to balance the load of dynamic activations and to minimize
the memory footprint of static activations, where static activations refer to
those that cannot be discarded regardless of freezing. This allows SlimFit to
freeze up to 95% of layers and reduce the overall on-device GPU memory usage of
transformer-based models such as ViT and BERT by an average of 2.2x, across
different NLP and CV benchmarks/datasets such as GLUE, SQuAD 2.0, CIFAR-10,
CIFAR-100 and ImageNet with an average degradation of 0.2% in accuracy. For
such NLP and CV tasks, SlimFit can reduce up to 3.1x the total on-device memory
usage with an accuracy degradation of only up to 0.4%. As a result, while
fine-tuning of ViT on ImageNet and BERT on SQuAD 2.0 with a batch size of 128
requires 3 and 2 32GB GPUs respectively, SlimFit enables their fine-tuning on a
single 32GB GPU without any significant accuracy degradation
- …